Delegating tasks between multiple processor cores

ABSTRACT

An electronic device comprising a first processor and a second processor, the second processor coupled to the first processor and adapted to receive an address from the first processor, to pause execution of a first thread at a switch point, and to use the address to retrieve and execute a group of instructions in a second thread. Prior to executing the group of instructions in the second thread, the second processor pushes onto a hardware-controlled stack data pertaining to the switch point, the data comprising information needed to resume execution of the first thread at the switch point.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to European Patent Application No.04291918.3, filed on Jul. 27, 2004 and incorporated herein by reference.This application is related to co-pending and commonly assignedapplications Ser. No. ______ (Attorney Docket No. TI-38581(1962-22200)), entitled, “Emulating A Direct Memory Access Controller,”and Ser. No. ______ (Attorney Docket No. TI-38584 (1962-22500),entitled, “Interrupt Management In Dual Core Processors,” which areincorporated by reference herein.

BACKGROUND

Many systems comprise dual processor cores. One of these processor coresis typically designated to be the “host,” or main, processor. The otherprocessor may be termed a “secondary” processor. While performing aseries of tasks, the host processor may determine that delegating one ormore tasks to the secondary processor would be expeditious, so that thehost processor may allocate its resources for performing other tasks. Insuch a case, the host processor must program the secondary processor toperform the task or tasks that are to be delegated. For example, if thehost processor delegates the execution of a particular algorithm to thesecondary processor, the host processor must program the secondaryprocessor to execute the algorithm. It is time-consuming andenergy-consuming for a host processor to have to program the secondaryprocessor.

BRIEF SUMMARY

Disclosed herein is a technique for delegating tasks between multipleprocessor cores. An illustrative embodiment comprises an electronicdevice comprising a first processor and a second processor, the secondprocessor coupled to the first processor and adapted to receive anaddress from the first processor, to pause execution of a first threadat a switch point, and to use the address to retrieve and execute agroup of instructions in a second thread. Prior to executing the groupof instructions in the second thread, the second processor pushes onto ahardware-controlled stack data pertaining to the switch point, the datacomprising information needed to resume execution of the first thread atthe switch point.

Another illustrative embodiment comprises a processor that comprisesdecode logic adapted to receive from another processor an address of agroup of instructions. The processor also comprises fetch logic coupledto the decode logic and adapted to fetch the group of instructions fromstorage. The decode logic pauses processing of a first thread at aswitch point and processes the group of instructions in a separatethread. Prior to processing the group of instructions, the processorpushes onto a hardware-controlled stack data pertaining to the switchpoint, the data comprising contents of registers used by the group ofinstructions.

Yet another illustrative embodiment comprises a method of delegating atask from a first processor to a second processor. The method comprisestransferring an address of a group of instructions from the firstprocessor to the second processor, pausing execution of a first threadin the second processor at a switch point, pushing data onto a stack,the data comprising contents of registers used by the group ofinstructions. The method further comprises retrieving the group ofinstructions using the address, executing the group of instructions in asecond thread, and popping the data off of the stack and storing thedata to the registers in the second processor.

NOTATION AND NOMENCLATURE

Certain terms are used throughout the following description and claimsto refer to particular system components. As one skilled in the art willappreciate, companies may refer to a component by different names. Thisdocument does not intend to distinguish between components that differin name but not function. In the following discussion and in the claims,the terms “including” and “comprising” are used in an open-endedfashion, and thus should be interpreted to mean “including, but notlimited to . . . ”. Also, the term “couple” or “couples” is intended tomean either an indirect or direct connection. Thus, if a first devicecouples to a second device, that connection may be through a directconnection, or through an indirect connection via other devices andconnections.

BRIEF DESCRIPTION OF THE DRAWINGS

For a more detailed description of the preferred embodiments of thepresent invention, reference will now be made to the accompanyingdrawings, wherein:

FIG. 1 shows a diagram of a system in accordance with preferredembodiments of the invention and including a Java Stack Machine (“JSM”)and a Main Processor Unit (“MPU”), in accordance with embodiments of theinvention;

FIG. 2 shows a block diagram of the JSM of FIG. 1 in accordance withpreferred embodiments of the invention;

FIG. 3 shows various registers used in the JSM of FIGS. 1 and 2, inaccordance with embodiments of the invention;

FIG. 4 shows the preferred operation of the JSM to include“micro-sequences,” in accordance with embodiments of the invention;

FIG. 5 shows an illustrative switching process between two executionthreads, in accordance with a preferred embodiment of the invention;

FIG. 6 shows an illustrative 32-bit instruction that may be incorporatedinto a micro-sequence, in accordance with a preferred embodiment of theinvention;

FIG. 7 shows a flow diagram of the switching process of FIG. 5, inaccordance with embodiments of the invention;

FIG. 8 shows a flow diagram describing a delegation technique inaccordance with a preferred embodiment of the invention; and

FIG. 9 shows the system described herein, in accordance with preferredembodiments of the invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The following discussion is directed to various embodiments of theinvention. Although one or more of these embodiments may be preferred,the embodiments disclosed should not be interpreted, or otherwise used,as limiting the scope of the disclosure, including the claims, unlessotherwise specified. In addition, one skilled in the art will understandthat the following description has broad application, and the discussionof any embodiment is meant only to be exemplary of that embodiment, andnot intended to intimate that the scope of the disclosure, including theclaims, is limited to that embodiment.

Described herein is a technique by which a host processor may delegate atask to a secondary processor by simply sending a command and an addressto the secondary processor. The command causes the secondary processorto use the address to locate and retrieve a group of instructions thathas been pre-programmed into the secondary processor. Executing thisgroup of instructions causes the secondary processor to perform whatevertask the host processor delegated to the secondary processor. However,before the secondary processor executes the group of instructions, itmust first stop what it is doing in a currently executing thread andmust further “bookmark” its place in the currently executing thread. Bybookmarking its place in the currently executing thread, the secondaryprocessor can execute the group of instructions and then resumeexecuting in the thread at the bookmarked location. Accordingly, atechnique for bookmarking a spot in a thread and a technique fordelegating tasks from the host processor to the secondary processor arenow discussed in turn.

In the context of software code, a “thread” may be defined as a singlestream of code execution. While executing a software program, aprocessor may switch from a first thread to a second thread in order tocomplete a particular task. For example, the first thread may comprisesome stimulus (i.e., instruction) that, when executed by the processor,causes the processor to halt execution of the first thread and to beginexecution of the second thread. The second thread may comprise theperformance of some task by a different portion of the software program.

The point in the first thread at which the switch is made may be termedthe “switch point.” When switching from the first thread to the secondthread, the processor first “bookmarks” the switch point, so that whenthe processor has finished executing the second thread of code, it canresume execution in the first thread at the switch point.

In order to bookmark the switch point, the processor stores allinformation that pertains to the switch point (known as the “context” ofthe switch point). Such information includes all registers, the programcounter, pointer to the stack, etc. The processor copies suchinformation to memory and retrieves the information later to resumeexecution in the first thread at the switch point. Bookmarking theswitch point is time-consuming and consumes power which may be inlimited supply in, for example, a battery-operated device such as amobile phone.

Processors that store to memory all information pertaining to the switchpoint unnecessarily spend time and power doing so. Whereas theaforementioned processors store all registers, the program counter,stack pointer, etc., the subject matter described herein is achieved atleast in part by the realization that in many cases, fewer than all suchinformation need be stored. For example, only three values are saved tosufficiently bookmark the switch point: the program counter (PC), asecond program counter called the micro-program counter (μPC), discussedbelow, and a status register. Once the processor has finished executingthe second thread, these three values provide sufficient information forthe processor to find the switch point in the first thread and resumeexecution at that switch point.

Accordingly, described herein is a programmable electronic device, suchas a processor, that is able to bookmark a switch point using a minimalamount of information pertaining to the switch point. A “minimal” amountof information generally comprises information in one or more registers,but not all registers, of a processor core. For example, in someembodiments, a “minimal” amount of information comprises a PC register,a μPC register and a status register. In other embodiments, a “minimal”amount of information comprises the PC register, the μPC register andthe status register, as well as one or more additional registers, butless than all registers. In still other embodiments, a “minimal” amountof information comprises less than all registers. In yet otherembodiments, a “minimal” amount of information consists of only theinformation (i.e., registers) necessary to bookmark a switch point,where the amount of information (i.e., number of registers) variesdepending on the processor used and/or the software application beingprocessed. In such cases, the “minimal” amount of information may simplybe one register or may be all of the registers in the processor core.Instead of storing all switch point information to memory, the processordescribed herein pushes a minimal amount of switch point informationonto a processor stack. Later, when the processor needs the switch pointinformation, it pops the information off of the stack and uses theinformation to resume execution at the switch point. In this way, thetime and power demands placed on the processor are reduced or evenminimized, resulting in increased performance.

Some situations, however, require more than a minimal amount ofinformation to be stored. For example, in these situations, a minimumamount of information may not be sufficient to properly bookmark aswitch point. Accordingly, the disclosed processor is capable ofbookmarking a switch point using a minimal amount of information(“minimal context store”) needed to resume execution at the switchpoint. The processor also is capable of bookmarking a switch point usingmore than a minimal amount of information (“full context store”), asdescribed further below.

The processor described herein is particularly suited for executingJava™ Bytecodes or comparable code. As is well known, Java isparticularly suited for embedded applications. Java is a stack-basedlanguage, meaning that a processor stack is heavily used when executingvarious instructions (e.g., Bytecodes), which instructions generallyhave a size of 8 bits. Java is a relatively “dense” language meaningthat on average each instruction may perform a large number of functionscompared to various other instructions. The dense nature of Java is ofparticular benefit for portable, battery-operated devices thatpreferably include as little memory as possible to save space and power.The reason, however, for executing Java code is not material to thisdisclosure or the claims which follow. Further, the processoradvantageously includes one or more features that permit the executionof the Java code to be accelerated.

Referring now to FIG. 1, a system 100 is shown in accordance with apreferred embodiment of the invention. As shown, the system includes atleast two processors 102 and 104. Processor 102 is referred to forpurposes of this disclosure as a Java Stack Machine (“JSM”) andprocessor 104 may be referred to as a Main Processor Unit (“MPU”).System 100 may also include memory 106 coupled to both the JSM 102 andMPU 104 and thus accessible by both processors. At least a portion ofthe memory 106 may be shared by both processors meaning that bothprocessors may access the same shared memory locations. Further, ifdesired, a portion of the memory 106 may be designated as private to oneprocessor or the other. System 100 also includes a Java Virtual Machine(“JVM”) 108, compiler 110, and a display 114. The MPU 104 preferablyincludes an interface to one or more input/output (“I/O”) devices suchas a keypad to permit a user to control various aspects of the system100. In addition, data streams may be received from the I/O space intothe JSM 102 to be processed by the JSM 102. Other components (notspecifically shown) may be included as desired for various applications.

As is generally well known, Java code comprises a plurality of“Bytecodes” 112. Bytecodes 112 may be provided to the JVM 108, compiledby compiler 110 and provided to the JSM 102 and/or MPU 104 for executiontherein. In accordance with a preferred embodiment of the invention, theJSM 102 may execute at least some, and generally most, of the JavaBytecodes. When appropriate, however, the JSM 102 may request the MPU104 to execute one or more Java Bytecodes not executed or executable bythe JSM 102. In addition to executing Java Bytecodes, the MPU 104 alsomay execute non-Java instructions. The MPU 104 also hosts an operatingsystem (“O/S”) (not specifically shown) which performs various functionsincluding system memory management, the system task management thatschedules the JVM 108 and most or all other native tasks running on thesystem, management of the display 114, receiving input from inputdevices, etc. Without limitation, Java code may be used to perform anyone of a variety of applications including multimedia, games or webbased applications in the system 100, while non-Java code, which maycomprise the O/S and other native applications, may still run on thesystem on the MPU 104.

The JVM 108 generally comprises a combination of software and hardware.The software may include the compiler 110 and the hardware may includethe JSM 102. The JVM may include a class loader, Bytecode verifier,garbage collector, and a Bytecode interpreter loop to interpret theBytecodes that are not executed on the JSM processor 102.

In accordance with preferred embodiments of the invention, the JSM 102may execute at least two types of instruction sets. One type ofinstruction set may comprise standard Java Bytecodes. As is well-known,Java is a stack-based programming language in which instructionsgenerally target a stack. For example, an integer add (“IADD”) Javainstruction pops two integers off the top of the stack, adds themtogether, and pushes the sum back on the stack. A “simple” Bytecodeinstruction is generally one in which the JSM 102 may perform animmediate operation either in a single cycle (e.g., an “iadd”instruction) or in several cycles (e.g., “dup2_x2”). A “complex”Bytecode instruction is one in which several memory accesses may berequired to be made within the JVM data structure for variousverifications (e.g., NULL pointer, array boundaries). As will bedescribed in further detail below, one or more of the complex Bytecodesmay be replaced by a “micro-sequence” comprising various otherinstructions.

Another type of instruction set executed by the JSM 102 may includeinstructions other than standard Java instructions. In accordance withat least some embodiments of the invention, the other instruction setmay include register-based and memory-based operations to be performed.This other type of instruction set generally complements the Javainstruction set and, accordingly, may be referred to as a complementaryinstruction set architecture (“C-ISA”). By complementary, it is meantthat a complex Java Bytecode may be replaced by a “micro-sequence”comprising C-ISA instructions. The execution of Java may be made moreefficient and run faster by replacing some sequences of Bytecodes bypreferably shorter and more efficient sequences of C-ISA instructions.The two sets of instructions may be used in a complementary fashion toobtain satisfactory code density and efficiency. As such, the JSM 102generally comprises a stack-based architecture for efficient andaccelerated execution of Java Bytecodes combined with a register-basedarchitecture for executing register and memory based C-ISA instructions.Both architectures preferably are tightly combined and integratedthrough the C-ISA. Because various of the data structures describedherein are generally JVM-dependent and thus may change from one JVMimplementation to another, the software flexibility of themicro-sequence provides a mechanism for various JVM optimizations nowknown or later developed.

FIG. 2 shows an exemplary block diagram of the JSM 102. As shown, theJSM includes a core 120 coupled to data storage 122 and instructionstorage 130. The core may include one or more components as shown. Suchcomponents preferably include a plurality of registers 140, threeaddress generation units (“AGUs”) 142, 147, micro-translation lookasidebuffers (micro-TLBs) 144, 156, a multi-entry micro-stack 146, anarithmetic logic unit (“ALU”) 148, a multiplier 150, decode logic 152,and instruction fetch logic 154. In general, operands may be retrievedfrom data storage 122 or from the micro-stack 146 and processed by theALU 148, while instructions may be fetched from instruction storage 130by fetch logic 154 and decoded by decode logic 152. The addressgeneration unit 142 may be used to calculate addresses based, at leastin part, on data contained in the registers 140. The AGUs 142 maycalculate addresses for C-ISA instructions. The AGUs 142 may supportparallel data accesses for C-ISA instructions that perform array orother types of processing. The AGU 147 couples to the micro-stack 146and may manage overflow and underflow conditions in the micro-stackpreferably in parallel. The micro-TLBs 144,156 generally perform thefunction of a cache for the address translation and memory protectioninformation bits that are preferably under the control of the operatingsystem running on the MPU 104. The decode logic 152 comprises auxiliaryregisters 151.

Referring now to FIG. 3, the registers 140 may include 16 registersdesignated as R0-R15. In some embodiments, registers R0-R5 and R8-R14may be used as general purposes (“GP”) registers usable for any purposeby the programmer. Other registers, and some of the GP registers, may beused for specific functions. For example, in addition to use as a GPregister, register R5 may be used to store the base address of a portionof memory in which Java local variables may be stored when used by thecurrent Java method. The top of the micro-stack 146 can be referenced bythe values in registers R6 and R7. The top of the micro-stack 146 has amatching address in external memory pointed to by register R6. Thevalues contained in the micro-stack 146 are the latest updated values,while their corresponding values in external memory may or may not be upto date. Register R7 provides the data value stored at the top of themicro-stack 146. Register R15 may be used for status and control of theJSM 102. At least one bit (called the “Micro-Sequence-Active” bit) instatus register R15 is used to indicate whether the JSM 102 is executinga simple instruction or a complex instruction through a micro-sequence.This bit controls, in particular, which program counter is used (PC orμPC) to fetch the next instruction, as will be explained below.

Referring again to FIG. 2, as noted above, the JSM 102 is adapted toprocess and execute instructions from at least two instruction sets, atleast one having instructions from a stack-based instruction set (e.g.,Java). The stack-based instruction set may include Java Bytecodes.Unless empty, Java Bytecodes may pop data from and push data onto themicro-stack 146. The micro-stack 146 preferably comprises the top nentries of a larger stack that is implemented in data storage 122.Although the value of n may vary in different embodiments, in accordancewith at least some embodiments, the size n of the micro-stack may be thetop eight entries in the larger, memory-based stack. The micro-stack 146preferably comprises a plurality of gates in the core 120 of the JSM102. By implementing the micro-stack 146 in gates (e.g., registers) inthe core 120 of the processor 102, access to the data contained in themicro-stack 146 is generally very fast, although any particular accessspeed is not a limitation on this disclosure.

The ALU 148 adds, subtracts, and shifts data. The multiplier 150 may beused to multiply two values together in one or more cycles. Theinstruction fetch logic 154 generally fetches instructions frominstruction storage 130. The instructions may be decoded by decode logic152. Because the JSM 102 is adapted to process instructions from atleast two instruction sets, the decode logic 152 generally comprises atleast two modes of operation, one mode for each instruction set. Assuch, the decode logic unit 152 may include a Java mode in which Javainstructions-may be decoded and a C-ISA mode in which C-ISA instructionsmay be decoded.

The data storage 122 generally comprises data cache (“D-cache”) 124 anddata random access memory (“DRAM”) 126. Reference may be made to U.S.Pat. No. 6,826,652, filed Jun. 9, 2000 and U.S. Pat. No. 6,792,508,filed Jun. 9, 2000, both incorporated herein by reference. Referencealso may be made to U.S. Ser. No. 09/932,794 (Publication No.20020069332), filed Aug. 17, 2001 and incorporated herein by reference.The stack (excluding the micro-stack 146), arrays and non-critical datamay be stored in the D-cache 124, while Java local variables, criticaldata and non-Java variables (e.g., C, C++) may be stored in D-RAM 126.The instruction storage 130 may comprise instruction RAM (“I-RAM”) 132and instruction cache (“I-cache”) 134. The I-RAM 132 may be used for“complex” micro-sequenced Bytecodes or micro-sequences, as describedbelow. The I-cache 134 may be used to store other types of Java Bytecodeand mixed Java/C-ISA instructions.

As noted above, the C-ISA instructions generally complement the standardJava Bytecodes. For example, the compiler 110 may scan a series of JavaBytecodes 112 and replace a complex Bytecode with a micro-sequence asexplained previously. The micro-sequence may be created to optimize thefunction(s) performed by the replaced complex Bytecodes.

FIG. 4 illustrates the operation of the JSM 102 to replace JavaBytecodes with micro-sequences. FIG. 4 shows some, but not necessarilyall, components of the JSM. In particular, the instruction storage 130,the decode logic 152, and a micro-sequence vector table 162 are shown.The decode logic 152 receives instructions from the instruction storage130 and accesses the micro-sequence vector table 162. In general and asdescribed above, the decode logic 152 receives instructions (e.g.,instructions 170) from instruction storage 130 via instruction fetchlogic 154 (FIG. 2) and decodes the instructions to determine the type ofinstruction for subsequent processing and execution. In accordance withthe preferred embodiments, the JSM 102 either executes the Bytecode frominstructions 170 or replaces a Bytecode from instructions 170 with amicro-sequence as described below.

The micro-sequence vector table 162 may be implemented in the decodelogic 152 or as separate logic in the JSM 102. The micro-sequence vectortable 162 preferably includes a plurality of entries 164. The entries164 may include one entry for each Bytecode that the JSM may receive.For example, if there are a total of 256 Bytecodes, the micro-sequencevector table 162 preferably comprises at least 256 entries. Each entry164 preferably includes at least two fields—a field 166 and anassociated field 168. Field 168 may comprise a single bit that indicateswhether the instruction 170 is to be directly executed or whether theassociated field 166 contains a reference to a micro-sequence. Forexample, a bit 168 having a value of “0” (“not set”) may indicate thefield 166 is invalid and thus, the corresponding Bytecode frominstructions 170 is directly executable by the JSM. Bit 168 having avalue of “1” (“set”) may indicate that the associated field 166 containsa reference to a micro-sequence.

If the bit 168 indicates the associated field 166 includes a referenceto a micro-sequence, the reference may comprise the full startingaddress in instruction storage 130 of the micro-sequence or a part ofthe starting address that can be concatenated with a base address thatmay be programmable in the JSM. In the former case, field 166 mayprovide as many address bits as are required to access the full memoryspace. In the latter case, a register within the JSM registers 140 isprogrammed to hold the base address and the vector table 162 may supplyonly the offset to access the start of the micro-sequence. Most or allJSM internal registers 140 and any other registers preferably areaccessible by the main processor unit 104 and, therefore, may bemodified by the JVM as necessary. Although not required, this latteraddressing technique may be preferred to reduce the number of bitsneeded within field 166. At least a portion 180 of the instruction 130may be allocated for storage of micro-sequences and thus the startingaddress may point to a location in micro-sequence storage 130 at which aparticular micro-sequence can be found. The portion 180 may beimplemented in I-RAM 132 shown above in FIG. 2.

Although the micro-sequence vector table 162 may be loaded and modifiedin accordance with a variety of techniques, the following discussionincludes a preferred technique. The vector table 162 preferablycomprises a JSM resource that is addressable via a register 140. Asingle entry 164 or a block of entries within the vector table 162 maybe loaded by information from the data cache 124 (FIG. 2). When loadingmultiple entries (e.g., all of the entries 164) in the table 162, arepeat loop of instructions may be executed. Prior to executing therepeat loop, a register (e.g., R0) preferably is loaded with thestarting address of the block of memory containing the data to load intothe table. Another register (e.g., R1) preferably is loaded with thesize of the block to load into the table. Register R14 is loaded withthe value that corresponds to the first entry in the vector table thatis to be updated/loaded.

The repeated instruction loop preferably comprises two instructions thatare repeated n times. The value n preferably is the value stored inregister R1. The first instruction in the loop preferably performs aload from the start address of the block (R0) to the first entry in thevector table 162. The second instruction in the loop preferably adds an“immediate” value to the block start address. The immediate value may be“2” if each entry in the vector table is 16 bits wide. The loop repeatsitself to load the desired portions of the total depending on thestarting address.

In operation, the decode logic 152 uses a Bytecode from instructions 170as an index into micro-sequence vector table 162. Once the decode logic152 locates the indexed entry 164, the decode logic 152 examines theassociated bit 168 to determine whether the Bytecode is to be replacedby a micro-sequence. If the bit 168 indicates that the Bytecode can bedirectly processed and executed by the JSM, then the instruction is soexecuted. If, however, the bit 168 indicates that the Bytecode is to bereplaced by a micro-sequence, then the decode logic 152 preferablychanges this instruction into a “no operation” (NOP) and sets themicro-sequence-active bit (described above) in the status register R15.In another embodiment, the JSM's pipe may be stalled to fetch andreplace this micro-sequenced instruction by the first instruction of themicro-sequence. Changing the micro-sequenced Bytecode into a NOP whilefetching the first instruction of the micro-sequence permits the JSM toprocess multi-cycle instructions that are further advanced in the pipewithout additional latency. The micro-sequence-active bit may be set atany suitable time such as when the micro-sequence enters the JSMexecution stage (not specifically shown).

As described above, the JSM 102 implements two program counters-the PCand the μPC. The PC and the μPC are stored in auxiliary registers 151,which in turn is stored in the decode logic 152. In accordance with apreferred embodiment, one of these two program counters is the activeprogram counter used to fetch and decode instructions. The PC 186 may bethe currently active program counter when the decode logic 152encounters a Bytecode to be replaced by a micro-sequence. Setting thestatus register's micro-sequence-active bit causes the micro-programcounter 188 to become the active program counter instead of the programcounter 186. Also, the contents of the field 166 associated with themicro-sequenced Bytecode preferably are loaded into the μPC 188. At thispoint, the JSM 102 is ready to begin fetching and decoding theinstructions comprising the micro-sequence. At or about the time thedecode logic begins using the μPC 188, the PC 186 preferably isincremented by a suitable value to point the PC 186 to the nextinstruction following the Bytecode that is replaced by themicro-sequence. In at least some embodiments, the micro-sequence-activebit within the status register R15 may only be changed when the firstinstruction of the micro-sequence enters the execute phase of JSM 102pipe. The switch from PC 186 to the μPC 188 preferably is effectiveimmediately after the micro-sequenced instruction is decoded, therebyreducing the latency.

The micro-sequence may end with a predetermined value or Bytecode fromthe C-ISA called “RtuS” (return from micro-sequence) that indicates theend of the sequence. This C-ISA instruction causes a switch from the μPC188 to the PC 186 upon completion of the micro-sequence. Preferably, thePC 186 previously was incremented, as discussed above, so that the valueof the PC 186 points to the next instruction to be decoded. Theinstruction may have a delayed effect or an immediate effect dependingon the embodiment that is implemented. In embodiments with an immediateeffect, the switch from the μPC 188 to the PC 186 is performedimmediately after the instruction is decoded and the instruction afterthe RtuS instruction is the instruction pointed to by the addresspresent in the PC 186.

As discussed above, one or more Bytecodes may be replaced with amicro-sequence or a group of other instructions. Such replacementinstructions may comprise any suitable instructions for the particularapplication and situation at hand. At least some such suitableinstructions are disclosed in U.S. Ser. No. 10/631,308 (Publication No.20040024989), filed Jul. 31, 2003 and incorporated herein by reference.

Replacement micro-sequence instructions also may be used to bookmarkswitch points when switching code execution threads. Referring to FIG.5, the line marked “T1” denotes a first thread T1 that is processed bythe JSM 102. The thread T1 comprises a plurality of Bytecodeinstructions, a plurality of micro-sequence instructions, or somecombination thereof. As previously explained, the instructions that areexecuted in thread T1 are retrieved from the instruction storage 130.More specifically, Bytecodes are retrieved from the Bytecode storage 170and micro-sequence instructions are retrieved from micro-sequencestorage 180.

While processing thread T1, the decode logic 152 may encounter asequence of JSM instructions that causes the processing of thread Ti tobe paused and the processing of a separate thread T2 to be initialized.This sequence is executed in thread T1 at or immediately prior to switchpoint 502. Execution of this sequence causes processing of thread T1 tostop, and processing of a separate thread T2 (denoted by line “T2”) tobegin in order to perform some separate task in thread T2. In someembodiments, instead of comprising a sequence of instructions(hereinafter referred to as “switch instructions”) that explicitlyperforms a thread switch, thread T1 may comprise a sequence ofinstructions that calls an operating system (OS) call (e.g., threadyield( ) ), which OS call selects one of a plurality of threads to executebased on thread priorities as dictated by the OS. A thread switch alsomay be directly initialized by the OS. Specifically, if the OS isrunning on the MPU 104, the OS may use a sequence of MPU commands toinitialize the thread switch.

Before the JSM 102 switches from processing thread T1 to processingthread T2, however, information pertaining to the switch point 502(i.e., “context” information) is stored by being pushed onto a T1 stack123 (e.g., a memory-based stack designated specifically for thread T1and stored in storage 122, FIG. 2) of the JSM 102. In some embodiments,the context information may be pushed onto the micro-stack 146. The useof the term “hardware-controlled stack” below and/or in the claims mayrefer to the micro-stack 146, the T1 stack 123 or the T2 stack 125,which T1 stack 123 and T2 stack 125 may be used as a micro-stack (e.g.,like micro-stack 146). Although the embodiments below are discussed interms of the T1 stack 123 and/or the T2 stack 125, the scope ofdisclosure is not limited to the use of these particular stacks andother stacks (e.g., micro-stack 146) may be substituted for the T1 stack123 and/or the T2 stack 125. Further, in preferred embodiments, thecontext information is a minimal amount of information, as describedbelow.

Context information that is collected preferably comprises the values ofthe PC 186, μPC 188 and status register (register R15) as they are atthe switch point 502. When the decode logic 152 encounters a sequence ofswitch instructions while processing thread T1, the sequence causes theexecution of thread T1 to be halted at switch point 502, the context ofswitch point 502 to be saved, and the execution of thread T2 to beinitialized. In some embodiments, commands sent from the MPU 104 mayperform a function similar to that of a sequence of switch instructions.

Regardless of whether a switch from thread T1 to thread T2 isinitialized by code in thread T1 or commands received from the MPU 104,the switching processes are similar. As described above, the executionof thread T1 is first halted. Once the JSM 102 has stopped processingthread T1, the JSM 102 is made to store the context of the switch point502. The context of the switch point 502 preferably comprises theminimum amount of information necessary for the JSM 102 to resumeprocessing thread T1 at switch point 502 after the JSM 102 has finishedprocessing thread T2. The JSM 102 stores the context of the switch point502 by retrieving the PC 186 and the μPC 188 from the auxiliaryregisters 151 and pushing them onto the T1 stack 123. The JSM 102 alsoretrieves the value of the status register R15 and pushes that valueonto the T1 stack 123 as well. These three values—the PC 186, the μPC188 and the status register R15—together comprise the minimum amount ofinformation needed for the JSM 102 to resume processing thread T1 atswitch point 502 after processing thread T2.

However, in some embodiments, it is preferable to also store a fourthvalue for efficiency purposes. Accordingly, the JSM 102 pushes a fourthvalue onto the T1 stack 123, where the fourth value is variable. Forexample, the fourth value may be one of the registers 140. The scope ofdisclosure is not limited to pushing the PC 186, μPC 188, statusregister and variable register onto the stack in any particular order,nor is the scope of disclosure limited to pushing these particularvalues onto the stack. As described above, any suitable number of values(e.g., a minimum amount of information) may be pushed onto the stack tostore a context.

In some embodiments, the switch instructions in the thread T1 may be32-bit instructions that, when executed, call a subroutine or some otherportion of code comprising instructions that store the context of theswitch point 502 by pushing context values (e.g., PC, μPC, statusregister) onto the T1 stack 123. FIG. 6 shows an illustrative embodimentof such 32-bit instructions. Specifically, FIG. 6 shows a 32-bitinstruction 599 that comprises information that describes the class ofthe 32-bit instruction 599 and further specifies the type of theinstruction 599. For example, as shown in the figure, bits 31:28describe the class of the instruction and bits 27:24 and bits 3:0describe the particular type of instruction being used. Bits 27:24 andbits 3:0 may specify, for example, that the instruction is aminimum-context push instruction which, when executed, causes variouscontext values to be pushed onto the stack, as described above. Bits23:2 are not of significance and preferably do not contain arguments orother relevant data. Instead, bits 23:2 may contain placeholder values(e.g., “0” bits). The scope of disclosure is not limited to the use ofinstructions as shown in FIG. 6. Context values also may be pushed ontothe T1 stack 123 by commands received from the MPU 104.

Each thread has its own RAM base address for storing local variablesused by that thread. The micro-sequence may contain instructions that,when executed, cause the JSM 102 to clean and invalidate the DRAM 126 tosave the local variables being used by thread T1. More specifically, atleast some of the contents of the DRAM 126 preferably are transferred toother areas of the storage 122, such as another DRAM (not specificallyshown) that may be located in the storage 122. The DRAM 126 then isinvalidated to clear space in the DRAM 126 for local variables that areused by thread T2. After the DRAM 126 has been cleaned and invalidated,the JSM 102 also may push the RAM base address onto the main stack, sothat the local variables used by thread T1 may be retrieved for lateruse. Also, because each thread pushes and pops different values onto themicro-stack 146, the JSM 102 may further clean and invalidate themicro-stack 146 in order to preserve the entries of the micro-stack 146and to clear the micro-stack 146 for use by thread T2. In at least someembodiments, the entries of the micro-stack 146 may be copied and/ortransferred to the data cache 124. Further, in some embodiments, the JSM102 may invalidate the current entries of the micro-stack 146, so thatafter a thread switch, the entries loaded into the micro-stack 146replace the invalidated entries.

After the PC 186, the μPC 188, the status register R15 and an optionalfourth register have been pushed onto the T1 stack 123, the JSM 102stores the stack pointer (i.e., register R6). The stack pointer may bedefined as the address of the topmost entry on T1 stack 123 and may bestored in any suitable memory (e.g., storage 122). Once at least the PC186, μPC 188, and the status register have been pushed onto the T1 stack123, and once the stack pointer for the T1 stack 123 has been stored inmemory, the context of switch point 502 has been stored.

Because the context has been stored, the JSM 102 is ready to switch fromthread T1 to thread T2. Similar to thread T1, thread T2 comprises aplurality of instructions (e.g., Bytecodes, micro-sequences or acombination thereof). Like thread T1, thread T2 may be executed multipletimes. However, each time processing switches from thread T1 to threadT2, as the context of thread T1 is stored from the JSM 102 onto a stack,so should the context of thread T2 be loaded from a stack onto the JSM102. The context of thread T2 may be found on top of the T2 stack 125.The T2 stack 125 preferably is a memory-based stack, specificallydesignated for thread T2 and stored in the storage 122. The thread T2context may have been pushed onto the T2 stack 125 at the end of aprevious iteration in a substantially similar fashion to thecontext-saving process described above in relation to thread T1, or,alternatively, the thread T2 context may have been pushed onto the T2stack 125 during the creation of the thread T2. It also may haveoccurred during the last thread switch of thread T2.

Thus, to begin processing thread T2, the JSM 102 loads the stack pointerfor T2 stack 125 from the storage 122 to register R6. The RAM baseaddress is loaded from the T2 stack 125, thus loading the localvariables for thread T2. The JSM 102 also loads the context of thread T2from the T2 stack 125 onto the auxiliary registers 151 and/or theregisters 140. In particular, the JSM 102 uses specific instructions topop context values off of the T2 stack 125, where at least some of thespecific instructions are indivisible. For example, a MCTXPOPinstruction may be used to pop minimum context values off of the T2stack 125. This MCTXPOP instruction, in at least some embodiments, isindivisible, mandatory for performing a context switch, and should notbe preempted. In this way, the JSM 102 is initialized to the context ofthe previous iteration of thread T2. Thus, the JSM 102 effectively isable to resume processing where it “left off.” The JSM 102 decodes andexecutes thread T2 in a similar fashion to thread T1.

After thread T2 has been executed, the JSM 102 may resume processingthread T1 at switch point 502. To resume processing thread T1 at switchpoint 502, the JSM 102 loads the context information of thread T1 fromthe T1 stack 123. The JSM 102 loads the stack pointer of thread T1 fromthe storage 122 and into register R6. The JSM 102 then pops the RAM baseaddress off of the T1 stack 123 and uses the RAM base address to loadthe local variables for thread T1. The JSM 102 also pops the statusvalue, μPC 188 and the PC 186 off of the T1 stack 123. The JSM 102stores the status value to the register R15 and stores the μPC 188 andthe PC 186 to the auxiliary registers 151. In this way, the contextinformation that is stored on top of the T1 stack 123 is popped off thestack and is used by the JSM 102 to return to the context of switchpoint 502. The JSM 102 may now resume processing thread T1 at switchpoint 502. The thread switch from thread T2 to thread T1 may becontrolled by a sequence of code being executed in thread T2 or,alternatively, by commands sent from the MPU 104. The thread switchingtechnique described above may be applied to any suitable pair of threadsin the system 100.

FIG. 7 shows a flowchart summarizing the process used to switch from onethread to another thread. The process 600 may begin by processing threadT1 (block 602). The process 600 comprises monitoring for a sequence ofcode in thread T1, or commands from the MPU 104, that initialize athread switch from thread T1 to thread T2 (block 604). If no suchsequence is encountered or no such command is received from the MPU 104,the process 600 comprises continuing to process thread T1 (block 602).However, if such a sequence or MPU 104 command is encountered, then theprocess 600 comprises halting processing of thread T1 (block 606) andpushing either the full or minimum context to the T1 stack (block 608),as previously described.

The process 600 further comprises cleaning and invalidating the RAM(block 610), pushing the RAM base address onto the T1 stack (block 612),cleaning and invalidating the micro-stack (block 614), and storing theT1 stack pointer to any suitable memory (block 616). The context ofthread T1 has now been saved. Before beginning to process thread T2, thecontext of thread T2 (if any) is to be loaded from the T2 stack.Specifically, the process 600 comprises loading the T2 stack pointerfrom memory (block 618), popping the RAM base address from the T2 stack(block 620), popping the full or minimum context from the T2 stack(block 622), and subsequently beginning processing of the thread T2(block 624).

In the embodiments described above, a minimum context (i.e., PC, μPC,status register) is pushed onto a stack to bookmark a switch point.While storing the minimum-context is faster than storing thefull-context (i.e., all registers in the JSM core), and storing theminimum context onto the stack is faster than moving registers from theJSM 102 to the D-RAM 126, in some embodiments, it may be desirable toperform a full-context store instead of a minimum-context store, forreasons previously described. Thus, in such embodiments, full contextsalso may be stored and/or loaded, in which case most or all of theregisters 140 as well as most or all of the auxiliary registers 151 arestored and/or loaded with each thread switch. For instance, in caseswhere one or more register values other than the PC 186, μPC 188, andstatus register are affected in a second thread, it may be desirable tostore all register values via a full-context store. In such cases, the32-bit instructions described above and shown in FIG. 6 may comprisedata (e.g., in bits 31:24 and 3:0) that causes a full-context store tobe performed. Similarly, a 32-bit instruction may comprise data thatcauses a full-context load to be performed. Further, as also describedabove, a full-context store and/or load may be initialized by a commandfrom the MPU 104 instead of by code being executed in thread T1. Afull-context store and/or load is performed in a similar manner to aminimum-context store and/or load, with the exception being a differencein the number of registers stored and/or loaded.

As explained above, the technique of storing contexts during threadswitches may be used to service commands received by the JSM 102 fromthe MPU 104. For example, in performing a series of tasks, the MPU 104may determine that delegating one or more tasks to the JSM 102 would beexpeditious, so that the MPU 104 may allocate its resources toperforming other tasks. In such a case, the MPU 104 sends a command tothe JSM 102, instructing the JSM 102 to perform a particular task. Thecommand is coupled with a parameter, which parameter preferablycomprises the address of a micro-sequence. The JSM 102, upon receivingthe command and the associated parameter, stores the parameter in asuitable storage unit, such as a register 140, an auxiliary register 151or on any one of the stacks in the JSM 102. The JSM 102 then uses theparameter (i.e., the micro-sequence address) to locate themicro-sequence in the micro-sequence storage 180. Upon locating themicro-sequence, the JSM 102 retrieves the micro-sequence and executesthe micro-sequence, thus obeying the command sent from the MPU 104. Themicro-sequence preferably is pre-programmed into the micro-sequencestorage 180.

In obeying the command from the MPU 104, the JSM 102 may be required topause whatever task it is completing at the moment the command isreceived from the MPU 104. More specifically, the JSM 102 may beperforming a particular task or executing a sequence of code in a firstthread T1 when it is interrupted with the command from the MPU 104. Inorder for the JSM 102 to service the command, it must first pause theexecution of the first thread T1 at a switch point and bookmark theswitch point by storing the context of the first thread T1. The JSM 102then may service the command from the MPU 104 in a second thread T2.Once the command from the MPU 104 has been serviced, the JSM 102 mayresume execution at the switch point in the first thread T1 byretrieving the stored context of the first thread T1.

The JSM 102 stores contexts and retrieves contexts in a manner similarto that previously described. In particular, when storing the context,the JSM 102 stores either a full context or, preferably, a minimumcontext. When storing a full context, the JSM 102 pushes all availableregisters 140 (and optionally auxiliary registers 151) onto the T1 stack123. When storing a minimum context, the JSM 102 pushes the PC 186, theμPC 188, the status register R15 and optionally a fourth register valueonto the T1 stack 123. In either case, before shifting to the secondthread T2, the JSM 102 also stores the value of the stack pointer (i.e.,register R6) in any suitable memory (e.g., DRAM 126). As previouslydescribed, the JSM 102 stores the value of the stack pointer so that,when it is ready to resume executing thread T1, the JSM 102 is able tolocate the context information that is on the T1 stack 123.Specifically, once the JSM 102 has serviced the command from the MPU 104and is ready to resume executing thread T1 at the switch point, the JSM102 uses the stack pointer to locate the context information on the T1stack 123. Once the context information is located, the JSM 102 pops thecontext information off of the stack and stores the context informationto the appropriate registers (e.g., registers 140 and/or auxiliaryregisters 151) in the JSM 102. The JSM 102 then may resume executing inthread T1.

This technique is summarized in FIG. 8. The process 800 shown in FIG. 8begins with the MPU 104 determining to delegate a task to the JSM 102(block 802). Accordingly, the MPU 104 delegates the task by sending tothe JSM 102 a command along with a parameter (block 804). The commandinstructs the JSM 102 to use the parameter (i.e., an address of amicro-sequence) to find a corresponding micro-sequence and to processthe micro-sequence. Once the JSM 102 receives the command from the MPU104, the JSM 102 pauses processing of a current thread T1 at a switchpoint (block 806). The JSM 102 then stores the context of the thread T1at the switch point (block 808). The JSM 102 then uses the parameterreceived from the MPU 104 to find the micro-sequence (block 810). Asexplained above, the parameter contains the address of thismicro-sequence. The JSM 102 subsequently retrieves and executes themicro-sequence in a thread T2 (block 812). After executing themicro-sequence, the JSM 102 restores the context of the switch point inthread T1 (block 814). Finally, the JSM 102 may resume executing threadT1 at the switch point (block 816). Such a technique is not limited tocommands received from the MPU 104. Instead, the JSM 102 may apply thistechnique to any task delegated to the JSM 102, such as an interrupt,exception routine, etc.

System 100 may be implemented as a mobile cell phone 415 such as thatshown in FIG. 9. As shown, the battery-operated, mobile communicationdevice includes an integrated keypad 412 and display 414. The JSMprocessor 102 and MPU processor 104 and other components may be includedin electronics package 410 connected to the keypad 412, display 414, andradio frequency (“RF”) circuitry 416. The RF circuitry 416 may beconnected to an antenna 418.

Although the above embodiments have been described in the context ofdual processor cores, the techniques described herein also areapplicable to any number of processor cores. For example, the system 100may comprise the MPU 104, the JSM 102, as well as at least oneadditional processor core. The host processor (i.e., the MPU 104) maydelegate tasks to the JSM 102 as well as any of the additional processorcores, using the techniques described above.

While the preferred embodiments of the present invention have been shownand described, modifications thereof can be made by one skilled in theart without departing from the spirit and teachings of the invention.The embodiments described herein are exemplary only, and are notintended to be limiting. Many variations and modifications of theinvention disclosed herein are possible and are within the scope of theinvention. Accordingly, the scope of protection is not limited by thedescription set out above. Each and every claim is incorporated into thespecification as an embodiment of the present invention.

1. An electronic device, comprising: a first processor; and a secondprocessor coupled to the first processor and adapted to receive anaddress from the first processor, to pause execution of a first threadat a switch point, and to use said address to retrieve and execute agroup of instructions in a second thread; wherein, prior to executingthe group of instructions in the second thread, the second processorpushes onto a hardware-controlled stack data pertaining to the switchpoint, said data comprising information needed to resume execution ofthe first thread at the switch point.
 2. The electronic device of claim1, wherein the group of instructions is pre-programmed into the secondprocessor.
 3. The electronic device of claim 1, wherein the datacomprises only a minimum amount of information needed to resumeexecution of the first thread at the switch point.
 4. The electronicdevice of claim 3, wherein said data comprises no more than fourregisters.
 5. The electronic device of claim 4, wherein three of saidfour registers consist of two program counters and a status of thesecond processor.
 6. The electronic device of claim 3, wherein theminimum amount of information comprises contents of only those registersused by the group of instructions.
 7. The electronic device of claim 1,wherein, prior to resuming execution of the first thread at the switchpoint, the second processor pops the data off of the stack and storesthe data to registers in the second processor.
 8. The electronic deviceof claim 1, wherein the second processor stores an address of a topmoststack entry to a memory prior to executing the second thread.
 9. Theelectronic device of claim 1, wherein the first processor sends to thesecond processor a command with said address, and wherein the commandcauses the second processor to retrieve and execute the group ofinstructions.
 10. The electronic device of claim 1, wherein the deviceis at least one of a battery-operated device or a mobile communicationdevice.
 11. A processor, comprising: decode logic adapted to receivefrom another processor an address of a group of instructions; and fetchlogic coupled to the decode logic and adapted to fetch the group ofinstructions from storage; wherein the decode logic pauses processing ofa first thread at a switch point and processes the group of instructionsin a separate thread; wherein, prior to processing the group ofinstructions, the processor pushes onto a hardware-controlled stack datapertaining to the switch point, said data comprising contents ofregisters used by the group of instructions.
 12. The processor of claim11, wherein the group of instructions is pre-programmed into theprocessor.
 13. The processor of claim 11, wherein the data comprisesonly contents of no more than four registers.
 14. The processor of claim13, wherein three of said four registers consist of two program countersand a status of the processor.
 15. The processor of claim 11, wherein,prior to resuming execution of the first thread at the switch point, theprocessor pops the data off of the stack and stores the data to saidregisters in the processor.
 16. The processor of claim 11, wherein theprocessor stores an address of a topmost stack entry to a memory priorto executing the separate thread.
 17. The processor of claim 11, whereinthe processor receives a command with said address, and wherein thecommand causes the processor to retrieve and execute the group ofinstructions.
 18. A method of delegating a task from a first processorto a second processor, comprising: transferring an address of a group ofinstructions from the first processor to the second processor; pausingexecution of a first thread in the second processor at a switch point;pushing data onto a stack, said data comprising contents of registersused by said group of instructions; retrieving said group ofinstructions using the address; executing said group of instructions ina second thread; and popping said data off of the stack and storing thedata to said registers in the second processor.
 19. The method of claim18, wherein pushing data onto the stack comprises pushing a maximum offour registers onto the stack.
 20. The method of claim 18, whereinpushing data onto the stack comprises pushing only contents of registersused by said group of instructions.
 21. The method of claim 20, whereinpushing only contents of registers used by said group of instructionscomprises pushing two program counters and a status of the secondprocessor onto the stack.
 22. The method of claim 18 further comprisingtransferring a command from the first processor to the second processor,wherein the command causes the second processor to retrieve and executesaid group of instructions.